The first step we need to take in order to apply distributional semantics to our texts is to design a script that counts the number of co-occurrences for each co-occurrent $c$ within a certain range of each target word $t$. This notebook will lead you through this process step by step.
First, we need to write a function that takes as input the complete file path of a text file, breaks our texts down into an ordered list of words, and saves it as, well, a list. You did this is exercise 1b, so you should re-use your code as much as possible here.


In [1]:
from string import punctuation
import re

def txt_to_list(filename):
    # insert your code here
    words = []
    [words.append(word.lower()) for word in (open(filename)).read().split()]
    l = []
    for word in words:
        [l.append(w) for w in re.split('[%s]+' % punctuation, word) 
         if w != '']
    return l

# Test your code on this short text.  Make sure to look at the results!
tokens = txt_to_list('austen-emma-excerpt.txt')
print(tokens)


['emma', 'by', 'jane', 'austen', '1816', 'volume', 'i', 'chapter', 'i', 'emma', 'woodhouse', 'handsome', 'clever', 'and', 'rich', 'with', 'a', 'comfortable', 'home', 'and', 'happy', 'disposition', 'seemed', 'to', 'unite', 'some', 'of', 'the', 'best', 'blessings', 'of', 'existence', 'and', 'had', 'lived', 'nearly', 'twenty', 'one', 'years', 'in', 'the', 'world', 'with', 'very', 'little', 'to', 'distress', 'or', 'vex', 'her', 'she', 'was', 'the', 'youngest', 'of', 'the', 'two', 'daughters', 'of', 'a', 'most', 'affectionate', 'indulgent', 'father', 'and', 'had', 'in', 'consequence', 'of', 'her', 'sister', 's', 'marriage', 'been', 'mistress', 'of', 'his', 'house', 'from', 'a', 'very', 'early', 'period', 'her', 'mother', 'had', 'died', 'too', 'long', 'ago', 'for', 'her', 'to', 'have', 'more', 'than', 'an', 'indistinct', 'remembrance', 'of', 'her', 'caresses', 'and', 'her', 'place', 'had', 'been', 'supplied', 'by', 'an', 'excellent', 'woman', 'as', 'governess', 'who', 'had', 'fallen', 'little', 'short', 'of', 'a', 'mother', 'in', 'affection']

OK, that should have been easy. The next easy step is to write another short function that takes the list returned by the txt_to_list function and produces a dictionary where the keys are the individual types in the text and the values are the total counts of that type in the text. Such a count dictionary will be necessary for our later calculations.

Hint: Using Counter here will simplify the task significantly.


In [2]:
from collections import Counter
def make_count_dict(l):
    # insert your code here
    return Counter(l)

# Below tests your code.  Again, make sure to check your results.
count_dict = make_count_dict(tokens)
print(count_dict)


Counter({'of': 8, 'her': 6, 'and': 5, 'had': 5, 'the': 4, 'a': 4, 'to': 3, 'in': 3, 'been': 2, 'very': 2, 'an': 2, 'by': 2, 'little': 2, 'with': 2, 'mother': 2, 'emma': 2, 'i': 2, 'best': 1, 'more': 1, 'long': 1, 'volume': 1, 'died': 1, 'blessings': 1, 'daughters': 1, 'consequence': 1, 'indulgent': 1, 'place': 1, 'fallen': 1, 'most': 1, 'disposition': 1, 'austen': 1, 's': 1, 'excellent': 1, 'remembrance': 1, 'jane': 1, 'as': 1, 'father': 1, 'seemed': 1, 'ago': 1, 'vex': 1, 'nearly': 1, 'sister': 1, 'governess': 1, 'years': 1, 'clever': 1, 'youngest': 1, 'woodhouse': 1, 'indistinct': 1, 'period': 1, '1816': 1, 'lived': 1, 'mistress': 1, 'was': 1, 'who': 1, 'from': 1, 'for': 1, 'than': 1, 'caresses': 1, 'house': 1, 'marriage': 1, 'too': 1, 'one': 1, 'twenty': 1, 'home': 1, 'affection': 1, 'distress': 1, 'short': 1, 'rich': 1, 'woman': 1, 'handsome': 1, 'comfortable': 1, 'existence': 1, 'or': 1, 'she': 1, 'supplied': 1, 'early': 1, 'affectionate': 1, 'chapter': 1, 'some': 1, 'unite': 1, 'two': 1, 'happy': 1, 'his': 1, 'have': 1, 'world': 1})

Now, the next step will be a bit more complex. We now want to write a function that takes as input a token list and window size and returns a dictionary where the keys are the target words $t$, which are the members of your token list, and the values are also dictionaries where the keys are the co-occurrents $c$, which are, again, the members of your type list, and the values for this dictionary are the number of times that $c$ co-occurs within the window around $t$. Mathematically, it is $n(c,t)$.

In the end, your dictionary should look something like this: {'the': {'the': 1000, 'aardvark': 8, 'be': 100...}}.

Hint: Consider using a defaultdict and a Counter for this.


In [3]:
from collections import defaultdict

def make_cooc_dict(l, window_size = 4):
    '''Takes as input a token list and a window size (default == 4).
    The window size is the distance in words both left and right from the target word.
    For instance, if you want 4 words left and 4 words right of your target word, window_size = 4
    '''
    d = defaultdict(Counter)
    for i, word in enumerate(l):
        w_l = []
        [w_l.append(w) for w in l[max(i-window_size, 0):min(i+window_size+1, len(l))]]
        w_l.remove(word)
        d[word] += Counter(w_l)
    # insert your code here
    return d

# Below tests your code.  Check your results.
cooc_dict = make_cooc_dict(tokens, window_size = 4)
#the following lines check to make sure that your cooc_dict is symmetrical
problems = []
[[problems.append((x,y)) for x in cooc_dict[y] 
  if cooc_dict[x][y] != cooc_dict[y][x]] for y in cooc_dict.keys()]
print(problems)
#the following line checks one tough case
cooc_dict['i'] == Counter({'chapter': 2, 'volume': 2, '1816': 2, 'woodhouse': 2, 'emma': 2, 'i': 2, 
                           'handsome': 1, 'austen': 1, 'jane': 1, 'clever': 1})


[]
Out[3]:
True

We have been using dictionaries up to this point instead of Pandas Series and DataFrame objects because the former are much more memory efficient than the latter. We should only switch over to Pandas objects when we want to start vectorizing our calculations. This is the point at which the increased memory drain of the Pandas objects pays for itself in speed!

Quiz:

Now it is time to put your functions to the test with your own texts. If you did not bring your own texts to the summer school with you, use the texts in the Data folder for lesson 1b. Below, write a script that will go to the folder on your computer where your text files are, return a list of the names of all the .txt files in that folder (Hint: Check out os.listdir() function to help with this), and then runs each text through each of the functions you wrote above. Finally, convert both your dictionaries into Pandas objects (you decide which type is the best for each dictionary) and save them as .pickle files using the df.to_pickle() method.

As background, pickle serializes your Python objects, which basically means that it saves your Python objects as Python objects, e.g., it will save your dictionaries as Python dictionary objects. This is typically both more efficient in terms of disk storage space and in processing time when saving and reloading the objects from and back into Python.

Hint: You might also want to check out the tkinter.filedialog functions. They open an open or save file dialog interface so that you can choose the files that you want to work with on-the-fly. They are great tools to ease file interaction and to generalize the code you write for different purposes and different operating systems.

Hint #2: If you are running out of memory when producing your Pandas objects or pickling them, try del objects that you don't need any more. For instance, once you have run the make_count_dict and make_cooc_dict functions, you don't need tokens anymore. So type:

del tokens

Do the same with your dictionaries once you have converted them to Pandas objects and your Pandas objects once you have pickled them.


In [4]:
from glob import glob
from os.path import basename
import pandas

def process():
    for filename in glob('./Data/*.txt'):
        print(filename)
        tokens = txt_to_list(filename)
        #pandas.Series(Counter(tokens)).to_pickle('./Data/%s.count.pickle' % (basename(filename)[:-4]))
        pandas.DataFrame(make_cooc_dict(tokens)).fillna(0).to_pickle('./Data/%s.cooc.pickle' % (basename(filename)))

process()
# insert your code here


./Data/bryant-stories.txt
./Data/edgeworth-parents.txt
./Data/austen-emma.txt
./Data/austen-pride.txt
./Data/austen-sense.txt
./Data/blake-poems.txt
./Data/blake-songs.txt
./Data/burgess-busterbrown.txt
./Data/carroll-alice.txt
./Data/chesterton-ball.txt
./Data/chesterton-thursday.txt
./Data/melville-piazza.txt
./Data/milton-paradise.txt
./Data/shakespeare-caesar.txt
./Data/shakespeare-hamlet.txt
./Data/whitman-leaves.txt
./Data/whitman-patriotic.txt
./Data/whitman-poems.txt